Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Fix batched impl for NVidia GPU #6164

Merged
merged 3 commits into from
Mar 27, 2024

Conversation

AidanBeltonS
Copy link
Contributor

@AidanBeltonS AidanBeltonS commented Mar 19, 2024

This PR fixes the failing batched_gemm call. oneMKL open source does not support combined halfs and floating points right now.
This change uses all halfs for the computation, this resolves failing tests in the Nvidia build.

I am currently in the process of adding these dtypes feature to oneMKL and this change can be reverted.

Tested on A100 and PVC

@AidanBeltonS
Copy link
Contributor Author

Copy link
Collaborator

@OuadiElfarouki OuadiElfarouki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. LGTM!

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Mar 20, 2024

This PR fixes the failing batched_gemm call. oneMKL open source does not support combined halfs and floating points right now. This change uses all halfs for the computation, this resolves failing tests in the Nvidia build.

I am currently in the process of adding these dtypes feature to oneMKL and this change can be reverted.

Tested on A100 and PVC

This code reverts the code change of mine: #5895
My PR is used to increase the performance on Intel GPU. Because fp32 is better than fp16 according to the test result.
And that will make the CI/unit-test passed - smaller NMS (I can't remember clearly).

I think this PR will reduce performance on Intel GPU.

  1. Please verify the performance by llama-bench on Intel Arc 770/Flex 170 and MTL iGPU.
  2. Please run CI/unit-test locally. make sure 100% pass on both GPUs.

@NeoZhangJianyu
Copy link
Collaborator

The PR uses mix fp32 and fp16 to increase performance of fp32 mode.
If user accept fp16, why not use fp16 mode directly? It has better performance.

It will reduce the accuracy due to involve fp16.
I don't think it's good solution to improve fp32 mode performance.

@AidanBeltonS
Copy link
Contributor Author

The PR uses mix fp32 and fp16 to increase performance of fp32 mode. If user accept fp16, why not use fp16 mode directly? It has better performance.

It will reduce the accuracy due to involve fp16. I don't think it's good solution to improve fp32 mode performance.

Yes, this will decrease performance for iGPUs. However, oneMKL does not support the mixed datatypes for gemm_batch currently with Nvidia and AMD targets. So right now this asserts with an error of unsupported combination. I am working on adding mixed data types support currently and hope to have a long term fix up-streamed in oneMKL.

I propose taking the less optimal, but more generic path, to maintain support for all targets temporarily and keep the relevant tests passing. Once oneMKL is updated then we can revert this change and have the benefits you are describing.

@NeoZhangJianyu
Copy link
Collaborator

The PR uses mix fp32 and fp16 to increase performance of fp32 mode. If user accept fp16, why not use fp16 mode directly? It has better performance.
It will reduce the accuracy due to involve fp16. I don't think it's good solution to improve fp32 mode performance.

Yes, this will decrease performance for iGPUs. However, oneMKL does not support the mixed datatypes for gemm_batch currently with Nvidia and AMD targets. So right now this asserts with an error of unsupported combination. I am working on adding mixed data types support currently and hope to have a long term fix up-streamed in oneMKL.

I propose taking the less optimal, but more generic path, to maintain support for all targets temporarily and keep the relevant tests passing. Once oneMKL is updated then we can revert this change and have the benefits you are describing.

Great!
Currently, Intel iGPU is the main target of SYCL backend.
So, we have to pay attention to the impact on iGPU.

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Mar 21, 2024

I think we should provide a good framework/rule to handle this case: different optimization for different hardware structure.

Here is my proposal:

  1. use #ifdef xxx
    Static build. Good performance, clean running code path for trouble shooting.
    only support 1 vendor devices in same time.
  2. use if (hardware is xxx)
    dynamic switch, a little performance decrease, more check of code path for trouble shooting.
    support mix vendor devices in same time.

For alternative 2.
Currently, the build/cmake only support build SYCL for 1 vendor devices, alternative 2 can't support mix vendor feature in fact.
So, looks like only alternative 1 is available.

Invite @ggerganov @slaren

@slaren
Copy link
Collaborator

slaren commented Mar 21, 2024

Just some thoughts: my goal in the long term is to move backends to dynamic libraries that can be loaded at runtime. Then it will be possible to use multiple backends simultaneously, and it would be possible use CUDA for an NVIDIA device, and SYCL for an Intel device at the same time. So there is no need for the SYCL backend to support every vendor, if there is already a backend available for that vendor that may provide better performance. I would be good of course for the SYCL backend to work with every device just to have more options, but the effort may be better spent improving the performance with Intel devices.

@NeoZhangJianyu
Copy link
Collaborator

Just some thoughts: my goal in the long term is to move backends to dynamic libraries that can be loaded at runtime. Then it will be possible to use multiple backends simultaneously, and it would be possible use CUDA for an NVIDIA device, and SYCL for an Intel device at the same time. So there is no need for the SYCL backend to support every vendor, if there is already a backend available for that vendor that may provide better performance. I would be good of course for the SYCL backend to work with every device just to have more options, but the effort may be better spent improving the performance with Intel devices.

Yes, I think the long-term idea is good.
So, alternative 1 is good.

@AidanBeltonS maybe you could use alternative 1 to handle your code for NV/AMD.

@abhilash1910
Copy link
Collaborator

I think simple ifdef macro branching can be quite useful here , keeping the rest of the code performant with igpu.

@NeoZhangJianyu
Copy link
Collaborator

Mix fp32 and fp16 still have accuracy loss issue.
Is there any test to approve no impact to result?

@AidanBeltonS
Copy link
Contributor Author

Iv protected the previous approach using a backend check at runtime. This I think is cleaner than using a macro and checking the backend is a cheap operation so there should be no noticeable drop in performance.

It has been checked on the Intel + NVidia GPUs.

@AidanBeltonS
Copy link
Contributor Author

Mix fp32 and fp16 still have accuracy loss issue. Is there any test to approve no impact to result?

All tests are passing with the approach so any possible issues with precision do not seem to be significant

@abhilash1910
Copy link
Collaborator

Can be merged. CI failure not related to SYCL. cc @ggerganov

@abhilash1910 abhilash1910 merged commit e82f9e2 into ggerganov:master Mar 27, 2024
55 of 56 checks passed
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Fix batched impl

* Maintain previous behaviour for igpu

* retrigger CI

---------

Co-authored-by: Abhilash Majumder <[email protected]>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
* Fix batched impl

* Maintain previous behaviour for igpu

* retrigger CI

---------

Co-authored-by: Abhilash Majumder <[email protected]>
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
* Fix batched impl

* Maintain previous behaviour for igpu

* retrigger CI

---------

Co-authored-by: Abhilash Majumder <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants